Understanding Prosper Loan Data Set

Uirá Caiado. April 7, 2015


Abstract

I explore a data set about P2P loans. My goals with the study are understanding how this practice works, and what makes a loan to be bound to a specific risk score.

Introduction

I have to tell the truth. Initially, the number of variables intimidated me. There are 113,937 rows in this data set and 81 columns (variables). It seems to come from Prosper website, a peer-to-peer (P2P) lending marketplace. I have already seen a brazilian website trying to do something similar in my country, but our “SEC” shut them off. Just financial companies can lend money here.

According to wikipedia, the SEC imposed that Prosper had to cease their operations due to violations on a piece of legislation that ensures transparency and try to avoid fraudulent activities in the american securities markets. The website took from November 2008 to July 2009 to resume their activities.

Bellow you can see the effect of this time closed:

There are 6 variables in the data set just valid to loans originated after 2009 and I believe that they were created to comply regulatory rules. In general, they are all about how much interest have been payed and how risky each loan were. Looking at other variables, I can see that they are pretty diverse but they are all related to these two factors: risk and return.

There are variable explicit about risk measurements (what was the classification when the loan was took, the scores from Prosper), about how much was paid and earned, meta data of the loan (the size in money and duration, when and where it was originated), about the profile of the borrower (employment status, his income, his debts, what he intend to do with the money), how each loan was founded and so on.

As P2P lending is something new for me, I would like to understand how it works on United States and I will use this database to achieve that.

Analysis

Univariate Section

In Brazil, where I live, I would expect the total loans rise at some months of the year. Usually at the begining and the end of the year, when people travel, pay taxes and buy gifts. Let’s start looking at the amount of money borrowed by month in United States.

As expected, the amount of loans rises at first and last months of the year. Well, if there are more money borrowed on these periods, maybe the number of loans defaulted increases between them. To verify that, I have to use the ClosedDate variable and filter just the defaulted and chargedoff loans. Let’s see:

Curriously the distribution of the amount borrowed by month looks like the distribution of amount defaulted/chargedoff by month. Well, I am not sure if this filtered data is relevant, let me see how many data points there are in each status. I will group all ‘Past Due’ buckets as the same buckets and show off the amount of data on each status percentualy. As can be seem bellow, 11% of the data is related to defaulted or chargedoff loans:

## 
##              Cancelled   Chargedoff/Defaulted              Completed 
##                   0.00                   0.15                   0.33 
##                Current FinalPaymentInProgress          Past Due Date 
##                   0.50                   0.00                   0.02

Thinking again, nobody would borrow money to pay back in the next couple of months, at least not in Brazil. Let’s see how many loans, percentually, were borrowed by duration:

## 
##   12   36   60 
## 0.01 0.77 0.22

It explains why the distributions of Defaulted/Chargedoff loans and the amount borrowed by month looks like the same. As people take loans on regular durations, 12, 36 and 60 months, the loans will ever due to the same month in different years.

How about the median of the loan? First, let’s see the quartiles of the value of each loan:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Wow…$ 1,000 to $35,000. It is a huge range. But it seems that the most of the loans are smaller, given that 50 % of the data set is up to $ 6500. Let’s take a close look at the loan original amount split by duration. I had to convert the variable “Term” to factor to make the chart bellow:

It seems that the loan for each duration is something that I should not plot together. The distribution of each one is prety different.

Well, and how about the ineterest rate? The distribution differs at each term? First, let’s see the histogram of the interest paid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

There is something different on the right tail. Let’s reduce the bin width to 0.5%

## Warning: position_stack requires constant width: output may be incorrect

Definitely there is something different after 30% of annual interest rate. Let’s take a close look at the range between 30%-37%.

## Warning: position_stack requires constant width: output may be incorrect

Now, more specifically in the range 31.5% and 32%.

## 
##  0.316 0.3165 0.3166 0.3169  0.317 0.3174 0.3175 0.3176 0.3177 0.3178 
##    121     60      1      1      2      3      9      2   3672      1 
## 0.3179  0.318 0.3185 0.3188 0.3189  0.319 0.3195 0.3197 0.3198 0.3199 
##      2      1     27      5      2      4     12      1      3   1651

Curiously there is a huge concentration at 31.77% and 31.99%. Let me see when those loan were took.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

Ok, I can see that there is no evidence of relationship with date. Maybe with the category of the term:

## 
##   36 
## 5323

Interesting, all loans with these interest rates have a term of 36 months. Let`s see how the rates are distributed by Term.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The most of loans have a term of 36 months. It is curious to see that just loans with term of 36 months have a more dispersed distribution. However, it is difficult to see how the how loans with term of 12 months are distributed. Bellow are the Box plots.

As previously observed on the frequency polygons , the box plot confirms that 36-months loans are more dispersed. Additionally, I can notice that the median increases sharply from 12 to 36 months and then increases slightly to 60 months. It makes sense, once that a long-term loan is much more unpredictable than a 12 months loan.

Following my reasoning, if the borrower rate are dispersed, I would expectedthat the Risk Score also was dispersed, given that the interest rate should be a function of the risk score. First, let me see the quartiles of the Prosper Rating.

## 
##     1     2     3     4     5     6     7  <NA> 
##  6935  9795 14274 18345 15581 14551  5372 29084
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   4.072   5.000   7.000   29084

As I can see, the risk scores are discrete and ordered. I believe that it is not a issue, given that a averge rate of, for instance, 5.3, has a meaning. It tells me that there is more data point on 5 than in 6. I also notice that there are many data point without a risk score. Below is a histogram just including those that has a score setted:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Well…They are not dispersed….and, thinking again, actually they shouldn’t be. The risk score is discrete, there is no way to be dispersed. I would need verify the distribution splitting the data by term, but I will check it later. Maybe the income can help me understand something:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

75% of the incomes declared are less than or equal to $ 6,825 and the maximum income value is $ 1,750,000. If I look at the income range, I can see that it was grouped just until $100,000. Let’s look at data where the monthly income lessthen 100,000 and has a prosper score setted.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It doesn’t add up much to my thoughts. I believe that these variables have to be related to another to be more meaningful. Now, let’s see why people borrow money, considering just who has a prosper rating setted.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

I need to rename the categories. Also, I am already seeing that there are a lot of categories that has less than 2,000 observations. I am going to rename those to ‘Other’.

Much better. The loans to roll over existing debts are predominant. Let’s see now how people were deep in debt when they took the loans using Prosper. Given that the loans labeled as Debt Consolidation are most common, I would expect that the ratio debt o income be greater.

## Warning: position_stack requires constant width: output may be incorrect

There are some outliers here. As it looks like that the most of data points are less than or equal to 1, I am going to filter out the debt/income ratio greater than that. Also, I will apply a log transformation here as it seems that there is a concentration in the middle of the distribution.

I am seeing that there is some kind of distribution here. The most of people have less than 30% of their income committed to debts. Also, I believe that there is no need to transform this variable in my future explorations.

So far, I have seen some evidences that people take more loans in specific months and the interest rate of these loans are strongly related to the term of the each loan. There is something curious about loan with 37%-38% of interest rate and I have not understood why loan with term of 36 month are so much more dispersed than 60 months, although the median of interest rates on 60 month are slightly greater than 36 months.

Also, I didn’t find nothing so interesting about risk score and income level, two variable that I was expecting more. On the other hand, maybe this variables have more to tell when related to others. I would like to understand better the spike of interest rate previously mentioned and why loans with term of 36-months are more dispersed than 60 months. Also I want to find out why risky loans are classified that way.

Bivariate Section

Let’s start looking at how risk score and borrower rate relate with each other. I’m expecting to find a strong relationship between them, after all, the interest rate should reflect the risk of the loan.

As the Prosper Rating is a discrete variable and the Borrower Rate variable is continuous, the chart looks like a bit cluttered. I can already notice that as better the risk score, more common are lower rates.

I am going to add some jitter on the chart and change the alpha to reduce the overplotting. I am also excluding monthly income greater than $ 100,000.

This chart shows that loans with higher interest rate are much more common in loans with higher risk (smaller scores). Let me see the correlation between these two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and BorrowerRate
## t = -854.5218, df = 75587, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9526073 -0.9512698
## sample estimates:
##        cor 
## -0.9519431

As expected, there is a strong negative relationship between the two variables, confirming that smaller scores results in higher interest rates. As I said before, It was expected. What I would like to understand is WHY this loans were classified as risky.

Let me see what is the relationship between risk score and monthly income. I am also filtering out monthly income equal to 0 (zero) in the text below.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and StatedMonthlyIncome
## t = 64.8236, df = 74894, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2236993 0.2372620
## sample estimates:
##       cor 
## 0.2304918

The test shows that the correlation is significant, given that the p-value is lower than 0.05 and indicates that as higher the monthly income, better is the risk score. On the other hand, the correlation coefficient, around 0.2, is not strong. It can be confirmed by the chart below.

There is a concentration around $5,000 dolars of income by month and between 3 to 6 risk scores, but I believe that it is more related to the typical american profile than a relationship between income and risk score. I am also seeing that even people with high income are classified as high risk. Maybe the correlation between the Prosper Rating and the Debt to Income ratio is stronger.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and DebtToIncomeRatio
## t = -33.9606, df = 69134, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1354213 -0.1207577
## sample estimates:
##        cor 
## -0.1280965

This test also shows that the correlation is significant and indicates that the debt/income ratio and the Prosper Rating are negatively correlated. Said another way, the higher the debt/income ratio, worst is the Prosper Rating. But, again, the correlation coefficient, around -0.1, is not strong. I wonder if there is any pair of variables, that is not Prosper Rating and Borrower Ratio, that presents a strong correlation between them.

Maybe there is something about the amount of the loan and Risk Score. I going to keep the same filters on the data that I used before.

It seems as some values of loans were allowed just for certain risk scores, but it does not add up to my understanding of Risk classification. Let me see how the loans for each risk score are divided by ListingCategory.

There are a lot of loans that were taken for Debt Consolidation on all Risk levels. If this listing category was predominant on all levels, it suggests to me that it is not a driver of risk level. Thus, I will not use this variable further on my analysis.

And how Debt to Income Ratio is related to Borrower Rate?

## Warning: Removed 7296 rows containing missing values (geom_point).

It is confuse. I am going to break up the Debt To Income Ratio into a categorical one and plot the two variables again. The most of classes I have defined as the quartiles of the data.

Nice, here is something useful. There is some relationship between the two variables and it is coherent. The more one is in debt, the more expensive is to take more loans.

Talking about debt to income ratio, let’s see the box plot of this variable related to Monthly Income. As I have seen before, there are big outliers here. Then, I am going to plot just income until $ 10,000 per month.

The level of debt is directly related to income. It corroborates with what I have found before. If people with bigger Debt to Income Ratio pay higher interest rates and this ratio is negatively related to monthly income, then, the bigger income, the smaller the interest rate.

If there is this relation with income, maybe there is some relationship with the assets of the borrower. Let me check the interest rate distribution when people are home owners or not. I will filter out borrower rate smaller than 39%

Here I also found some relationship. I guess It couldn’t be different . When you talk about risk, you talk about uncertainty. If you are confident that the borrower will pay the loan back, you can charge a lower interest rate. It is how things work between countries, makes sense that it works in the same way between people. If someone earns more money per month or he has a house, you might be more confident that he will have conditions to pay you back.

Well…I am talking over risk and uncertainty….but, first of all, loans classified as risky really are riskier? Let’s check the default ratio between risk scores.

## No id variables; using all as measure variables

Yes, they are. The number of borrowers that didn’t pay the loan back are greater at worst risk scores.

Ok… I want to check one last thing. Let’s see the distribution of borrower rate by month. In order to put into perspective, I am also going to plot the default ratio by month and the total number of loans by month.

## No id variables; using all as measure variables

Look to that. As the number of loans by month increases, the median of interest rate decreases, as well as the percentage Defaulted ratio. Right now, I can’t say why , although I can think in a reason for this behaviour. For instance, in finance, if you want to reduce the risk of your investments, you should diversify them. If I suppose that the interest rate is some function of the defaulted ratio, and then I diversified to whom I lend money, I would expect that the defaulted ratio went down and then I could reduce the interest rate.

In this section, I focused on understanding what makes someone be classified as risky. I have explored many variables. Many of them presented some correlation between what I was looking for, but not strong enough to allow me to formulate some hypotheses.

I have learned that loans with worst risk scores are, in fact, risky. They present a bigger defaulted ratio. As expected, I also have found that better risk classification is related to smaller interest rates and the amount that someone borrows.

Also, I verified that the more deep in debt someone is, related to his income, the more expensive is to take loans. My analysis has shown that the level of debt from someone is related to his level of income. I also have saw that when someone has a house, the average interest rate is smaller.

In the next section, I will come back to the loans with 37%-38% of interest rate and how the term variable relates to what I already have found.

Multivariate Section

-Relationship observed -Any surprise?

Histograms Revisited

Notes:

One clue to the discritness of data is that the 3rd percentile is the same of the maximum value

# 
# yo <- read.csv('yogurt.csv')
# str(yo)
# 
# #change the id from an int to a factor
# yo$id <- factor(yo$id)
# str(yo)
# 
# ggplot(data = yo, aes(x = price)) +
#   geom_histogram()
# 
# ggplot(data = yo, aes(x = price)) +
#   geom_histogram(binwidth=10)
# 
# 
# summary(yo$price)
# 
# unique(yo$price)
# 
# length(unique(yo$price))

Final Plots and Summary

Reflection

bla